80 research outputs found
Core Lexicon and Contagious Words
We present the new empirical parameter , the most probable usage
frequency of a word in a language, computed via the distribution of documents
over frequency of the word. This parameter allows for filtering the core
lexicon of a language from the content words, which tend to be extremely
frequent in some texts written in specific genres or by certain authors.
Distributions of documents over frequencies for such words display long tails
as representing a bunch of documents in which such words are used in
abundance. Collections of such documents exhibit a percolation like phase
transition as the coarse grain of frequency (flattening out the
strongly irregular frequency data series) approaches the critical value .Comment: RevTex, 4 pages, 2 figure
Syntactic Knowledge via Graph Attention with BERT in Machine Translation
Although the Transformer model can effectively acquire context features via a
self-attention mechanism, deeper syntactic knowledge is still not effectively
modeled. To alleviate the above problem, we propose Syntactic knowledge via
Graph attention with BERT (SGB) in Machine Translation (MT) scenarios. Graph
Attention Network (GAT) and BERT jointly represent syntactic dependency feature
as explicit knowledge of the source language to enrich source language
representations and guide target language generation. Our experiments use gold
syntax-annotation sentences and Quality Estimation (QE) model to obtain
interpretability of translation quality improvement regarding syntactic
knowledge without being limited to a BLEU score. Experiments show that the
proposed SGB engines improve translation quality across the three MT tasks
without sacrificing BLEU scores. We investigate what length of source sentences
benefits the most and what dependencies are better identified by the SGB
engines. We also find that learning of specific dependency relations by GAT can
be reflected in the translation quality containing such relations and that
syntax on the graph leads to new modeling of syntactic aspects of source
sentences in the middle and bottom layers of BERT
GATology for Linguistics: What Syntactic Dependencies It Knows
Graph Attention Network (GAT) is a graph neural network which is one of the
strategies for modeling and representing explicit syntactic knowledge and can
work with pre-trained models, such as BERT, in downstream tasks. Currently,
there is still a lack of investigation into how GAT learns syntactic knowledge
from the perspective of model structure. As one of the strategies for modeling
explicit syntactic knowledge, GAT and BERT have never been applied and
discussed in Machine Translation (MT) scenarios. We design a dependency
relation prediction task to study how GAT learns syntactic knowledge of three
languages as a function of the number of attention heads and layers. We also
use a paired t-test and F1-score to clarify the differences in syntactic
dependency prediction between GAT and BERT fine-tuned by the MT task (MT-B).
The experiments show that better performance can be achieved by appropriately
increasing the number of attention heads with two GAT layers. With more than
two layers, learning suffers. Moreover, GAT is more competitive in training
speed and syntactic dependency prediction than MT-B, which may reveal a better
incorporation of modeling explicit syntactic knowledge and the possibility of
combining GAT and BERT in the MT tasks
A Robust statistical model of word frequencies
Paper presented at the 5th Strathmore International Mathematics Conference (SIMC 2019), 12 - 16 August 2019, Strathmore University, Nairobi, KenyaFor the purposes of language teaching or automatic language processing it is important
to know how frequent a word is. However, a simple procedure counting the number of
times a word occurs in a collection of texts leads to many unfortunate artefacts because
some words occur too often in a small number of texts leading to frequency bursts. Our
task in this paper is to introduce a statistical model which uses methods from robust
statistics to estimate the frequencies of words in a collection of texts.University of Leeds, United Kingdom
BERT goes off-topic : investigating the domain transfer challenge using genre classification
While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at https://github.com/dminus1/genr
- …